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Abstract 


Accurately rendering reverberation is critical to produce realistic binaural audio, 
particularly in augmented reality applications where virtual objects must blend in 
seamlessly with real ones. However, rigorously simulating sound waves interacting 
with the auralised space can be computationally costly, sometimes to the point of 
being unfeasible in real time applications on resource-limited mobile platforms. 
Luckily, knowledge of auditory perception can be leveraged to make computational 
savings without compromising quality. This chapter reviews different approaches 
and methods for rendering binaural reverberation efficiently, focusing specifically 
on Ambisonics-based techniques aimed at reducing the spatial resolution of late 
reverberation components. Potential future research directions in this area are also 
discussed. 


Keywords: binaural audio, reverberation, Auralisation, Ambisonics, perceptual 
evaluation 


1. Introduction 


Reverberation results from pairing a sound source with an acoustic space. After 
emanating from the source, a sound wave will interact with its environment, 
undergoing reflection, diffraction and absorption. Thus, a listener will receive fil- 
tered replicas of the original wavefront (echoes) arriving from various directions at 
different times, causing the impression that the original sound persists in time. 
According to the so-called precedence effect, the direct sound allows a listener to 
determine the position of the sound source, while early reflections are generally not 
perceived as distinct auditory events [1-3]. As stated by Wallach et al. [1], the 
maximum delay after which a reflection is no longer ‘fused’ with the direct sound 
depends on the signal, being around 5 ms for single clicks and as long as 40 ms for 
complex signals such as speech or music [4]. Nevertheless, early reflections can 
broaden the perceived width of the source and shift its apparent position, as shown 
experimentally by Olive and Toole [5]. Furthermore, they can modify the signal’s 
spectrum due to phase cancellation and subsequent comb filtering, as shown by 
Bech in his study on small-room acoustics [6]. Such phenomena can alter the 
perception of the room on a higher level. For example, Barron and Marshall [7] 


1 IntechOpen 


Advances in Fundamental and Applied Research on Spatial Audio 


argued that the timing, direction of arrival, and spectra of early lateral reflections 
contribute to the sense of “envelopment’—defined as the ‘subjective impression of 
being surrounded by the sound’. The time delay between the direct sound and the 
first distinct echo has also been shown to be a relevant feature: in the case of small 
rooms, Kaplanis et al. [8] found that it was correlated with the perception of 
environment dimensions and ‘presence’—or ‘sense of being inside an enclosed 
space and feeling its boundaries’—while in the case of concert halls, Beranek [9] 
linked it to a sense of ‘intimacy’. 

As time passes and the sound waves that emanated from the source continue 
interacting with the environment, the temporal density of echoes increases, and the 
resulting sound field becomes more diffuse. At this point, temporal and spatial 
features of individual echoes become less relevant, and late reverberation can be 
characterised as a stochastic process. An important parameter used to define such 
process is the reverberation time (RT), or the “duration required for the space- 
averaged sound energy density in an enclosure to decrease by 60 dB after the source 
emission has stopped’ [10], which is generally proportional to the volume of the 
room. Yadav et al. [11] suggested that RT contributes to the perception of environ- 
ment dimensions most significantly in large spaces, whereas early reflections have 
greater importance in small rooms. Although late reverberation is often modelled as 
diffuse and isotropic (i.e., with an even distribution of energy across directions 
from the listeners’ point of view). Alary et al. [12] showed that this assumption may 
not always hold and directionality should be taken into account, especially for 
asymmetrical spaces, such as a corridor. 

When reproduced binaurally (e.g., through headphones), it has been shown that 
reverberation increases the sense of externalisation, i.e., the illusion of virtual sound 
sources being outside the head, when compared to anechoic sounds [13, 14]. It has 
been suggested that this effect can be achieved even by just adding the early reflec- 
tions [13], while the contribution of late reverberation (>80 ms) is smaller in com- 
parison [15]. Previous studies have looked into the contribution of both monaural and 
binaural cues to the externalisation of reverberant binaural signals. Monaural cues 
have been shown to have limited importance by Hassager et al. [16] and Jiang et al. 
[17], who argued that spectral detail is not as critical in the reverberant sound as it is 
in the direct sound. Regardless, it has been reported that applying spectral correction 
(headphone equalisation) to binaural signals could increase externalisation and other 
subjective attributes when employing headphones with limited reproduction band- 
width [18, 19]. Binaural cues, on the other hand, have been shown to be critical: 
Leclere et al. [14] suggested that reverberation increases externalisation of a binaural 
signal as long as interaural differences are introduced. This is supported by Catic et al. 
[15], who reported a considerable decrease in externalisation when the reverberant 
part of auralised speech was presented diotically. Such effects have been linked to 
specific binaural cues, such as interaural level differences (ILDs) and interaural 
coherence (IC). Recent studies have reported correlations between the level of exter- 
nalisation and the amount of temporal fluctuations of ILDs and IC in the binaural 
signals [14, 15, 20]. Moreover, Li et al. [21, 22] highlighted the importance of rever- 
beration specifically in the contralateral ear signal, showing a stronger contribution to 
externalisation than its ipsilateral counterpart, which is explained by the fact that 
reverberation is proportionally louder on the contralateral side due to the head 
shadow effect. Finally, according to the ‘room divergence effect’, externalisation of 
simulated binaural signals increases when the rendered reverberation matches the 
listener's expectations given their prior knowledge of the room [23-25]. Head move- 
ments and vision also play an important role in spatial audio perception [26], but they 
are not covered here—for a thorough review on sound externalisation, the reader is 
referred to Best et al. [27]. 
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In summary, reverberation greatly influences how a listener perceives an audi- 
tory scene by providing information on the room characteristics, the size and 
location of the sound sources and, in the case of binaural simulations, affecting the 
level of externalisation. Consequently, it should be modelled carefully when pro- 
ducing realistic acoustic simulations, although this can prove to be a challenging 
task in real-time systems with limited resources. The next sections of this chapter 
will address the issue of balancing computational efficiency and perceptual quality 
when simulating reverberation. 


2. Simulating reverberation efficiently 


Simulating reverberation can be useful in various applications. In some cases, 
such as music production, it has mainly an aesthetic value and may not require 
highly realistic simulations. In other cases, such as architectural acoustics, aug- 
mented reality (AR) and, to a lesser extent, virtual reality (VR), the goal is to 
recreate a real acoustic space, so reverberation needs to be modelled with sufficient 
accuracy. For instance, an AR system allows the users to perceive the real world 
integrated with a virtual layer, e.g., a videoconferencing application in which users, 
wearing a pair of AR glasses, see holograms of their interlocutors which look and 
sound as if they were in the same room. From the acoustic point of view, this is 
particularly challenging to implement because the listener is exposed to real sound 
sources as well as virtual ones, so the simulated acoustics should be realistic enough 
for the virtual and real sources to be appropriately blended. Even though highly 
realistic reverberation is often desired, it can easily become too expensive to simu- 
late in real time for interactive applications, where the auditory scene is expected to 
vary over time—even more so if many virtual sources are simulated [28]. There- 
fore, it is relevant to explore simplified reverberation models that reduce computa- 
tional costs without compromising quality. 

In the most general case, reverberation is rendered by convolving a dry audio 
signal with a room impulse response (RIR), which is the time-domain acoustic 
transfer function between a sound source and a receiver in a given acoustic space 
(room), assuming that the system formed by these is linear and time-invariant. The 
RIR can be either measured acoustically [29] or in a simulated environment. Several 
simulation techniques have been proposed, which range from rigorous but compu- 
tationally expensive physical models, such as the finite-difference time-domain 
method [30], to simpler but less accurate geometrical models, such as the image 
source method [31] or scattering delay networks [32]. Ray-tracing and cone-tracing 
are also popular techniques that allow for a variable degree of accuracy [28, 33-35], 
albeit the computational requirements can become rather intensive when sound 
sources move in space, and real-time implementations are often limited to very 
simplified models and/or renderings. 

Reverberation may also be generated through computationally lighter 
‘convolution-less’ methods, such as Schroeder reverberators [36] or feedback delay 
networks (FDN) [37-39]. Such techniques are generally less accurate than 
convolution-based methods but can be useful to efficiently model the less critical 
parts of the RIR such as the late-reverberation tail [40]. 

With the goal of finding a balance between computational cost and perceived 
quality, several parametric reverberation models have been proposed [40-47]. Most 
of them aim to alleviate computational costs by rendering early reflections with a 
higher temporal and spatial accuracy than late reverberation, based on the concept of 
mixing time, i.e., the instant after which the RIR does not perceivably change across 
different listeners’ positions or orientations within the room (see Figure 1) [48]. 
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An early example of this approach, known as ‘hybrid’ reverberation, was presented 
by Murphy and Stewart [40], who proposed to employ convolution-based rendering 
for early reflections and simpler methods (e.g., FDN) to produce late reverberation. A 
key aspect of the hybrid model is correctly establishing the mixing time, which 
depends on the room volume, being higher for larger rooms [48]. 

In spatial audio applications, it is important to accurately simulate the direction of 
arrival of early reflections (and of late reverberation, to a lesser extent) which adds 
yet another layer of difficulty to the process. This also means that the reproduction 
method should be able to replicate such spatial cues. An example of a playback system 
would be a loudspeaker array surrounding the listener that can simulate virtual 
sources and reflections through amplitude panning [49] or Ambisonics [50]. In the 
case of binaural audio, such systems may be mimicked through virtual loudspeakers, 
but other methods also exist, as discussed in Section 2.1. 

Note that the scope of this chapter covers reverberation’s spatial features from 
the listener’s point of view, but not from the source’s point of view. Therefore, 
sound source directivity is not discussed, even though it is an important topic on its 
own—e.g., it is essential to model it correctly in a six-degrees-of-freedom (6DoF) 
application where the listener is allowed to walk past a directional source [51]. 


2.1 The binaural case 


When rendering reverberation binaurally, directional information of reflected 
sounds is encoded in the binaural room impulse response (BRIR), i.e., a pair of RIRs 
that are measured at the listener’s ear canals, in the form of monaural and interaural 
cues. Therefore, the most effective and straightforward way to achieve an accurate 
binaural rendering is to convolve an anechoic audio signal with a BRIR. Static (non- 
head-tracked) BRIR-based renderings can produce highly authentic binaural sig- 
nals, to the point of being indistinguishable from those emitted by real sound 
sources [52-55]. On the other hand, dynamic (head-tracked) renderings are more 
challenging to implement, as they require swapping between BRIRs as the listener 
or the source move. It is worth noting that, when dealing with binaural renderings 
of anechoic environments, an angular movement of a source relative to the listener 
is roughly equivalent to a head rotation of the listener, which is typically trivial to 
compute in the Ambisonics domain using rotation matrices ([56], Section 5.2.2). 
However, this does not generalise to reverberant environments, where the room 


10 l Direct sound 
l First order reflections 
| Second order reflections 
so 0 Reflections of order > 3 
& ee eee Estimated mixing time 
© -10 l 
S l 
= 
Z l 
a 
E -20 | 
z | 
-30 l | 
-40 A h \ AU ANALIA AANA A ANTA AT AMA 
0 20 40 60 80 100 120 
Time in ms 
Figure 1. 


First 130 ms of an RIR, expressed in decibels relative to the peak value. The RIR was simulated with the image 
source method [31] for an omnidirectional point source placed 10 m away from the receiver in a room with an 
approximate volume of 2342.7 m°. The mixing time, estimated according to Lindau et al. [48], is indicated. 


Reverberation and its Binaural Reproduction: The Trade-off between Computational... 
DOI: http://dx.doi.org/10.5772/intechopen.101940 


provides a frame of reference, and the angular movement of a source is not 
equivalent to rotating the listener’s head. 

A recent study has suggested that BRIRs should be measured by varying the 
listener position in increments of 5 cm or less in a three-dimensional grid (which 
can be a costly process) to achieve a dynamic convolution-based rendering in which 
the swapping is seamless to the listener [57]. Alternatively, one may start from a 
coarser spatial grid and interpolate BRIRs at intermediate positions. Unfortunately, 
BRIR interpolation is not trivial because the time and direction of arrival of each 
reflection may vary depending on the receiver’s position, changing the BRIR’s tem- 
poral structure across the grid. Nevertheless, recent studies have shown promising 
progress by employing dual-band approaches and heuristics to match early reflec- 
tions in the time domain [58, 59]. On a related note, another active research topic is 
the extrapolation of RIRs in the Ambisonics domain for 6DoF applications 
(e.g., [60-63]), which is further discussed in Section 4. 

Although BRIRs are mainly obtained through binaural measurements made on a 
person’s or a mannequin’s head [55], they may also be generated from RIRs that 
were either measured with microphone arrays [64-68] or simulated [28, 35]. This 
approach typically involves identifying individual reflections and their direction of 
arrival, e.g., with the help of the spatial decomposition method (SDM) [65], and 
then convolving each reflection with a head-related impulse response (HRIR) for 
the corresponding direction [69]—which is equivalent to a multiplication with a 
head-related transfer function (HRTF) in the frequency domain. However, render- 
ing the full length of the BRIR this way can easily become expensive, which is why 
simplified models such as the aforementioned ‘hybrid’ one become important: we 
can just render a few early reflections accurately while modelling late reverberation 
as a stochastic, non-directional process, and still produce binaural signals that are 
not perceptually different from properly rendered ones. This has been recently 
shown by Brinkmann et al. [47], who suggested that accurately rendering just six 
early reflections plus stochastic late reverberation may be enough to produce 
auralisations that are perceptually indistinguishable from a fully-rendered refer- 
ence, for a simulation of a shoebox-type room. 

It should be noted that modelling late reverberation as isotropic is computation- 
ally inexpensive but may lead to noticeable degradation when simulating asymmet- 
rical rooms (e.g., a long and narrow corridor) where late reverberation is highly 
directional [12]. For such cases, Alary et al. have proposed to employ directional 
feedback delay networks (DFDN) [39], which extend the functionality of tradi- 
tional FDNs to spatial audio and allow to inexpensively produce non-uniform 
reverberation, so that the RT is direction-dependent. A downside of DFDNs is their 
inability to correctly reproduce early reflections, which should be modelled 
separately for best results. 

Another simplification consists in quantising the direction of arrival of reflec- 
tions by ‘snapping’ them to the closest neighbour in a predefined grid. This method 
is explored by Amengual Gari et al. [69], who found that an RIR may be quantised 
to just 14 directions in a Lebedev grid [70] and still be used to render binaural 
signals through SDM without perceptual degradation when compared to the origi- 
nal. The scattering delay network method (SDN) is based on a similar premise, 
quantising the RIR to as many directions as first-order reflections, e.g., six for a 
cuboid room, while obtaining good results in perceptual evaluations [32]. The 
rationale of SDN is that early reflections are computed accurately, while later ones 
are approximated with higher error as time advances, which is a sensible approach 
from a perceptual point of view. However, it might lead to an inaccurate late 
reverberation tail, which is why combining SDN with an inexpensive method for 
late reverberation simulation (e.g., DFDN) might be a promising alternative. 
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On the other hand, rather than generating separate BRIRs for each rendered sound 
source, one may also ‘encode’ the sum of all of them into a single sound field, and 
then reproduce it binaurally, e.g., by means of a set of virtual loudspeakers. That way, 
only the virtual loudspeaker signals must be binaurally rendered, independently of 
the number of sources that form the sound field. This is a convenient simplification 
when many sources are rendered at once. As mentioned earlier, typical loudspeaker- 
based sound field reproduction methods include vector-based amplitude panning 
[49] and high-order Ambisonics [50, 56, 71]. The latter is by far the more popular 
method for binaural rendering, given its efficient simulation of head rotations 
([56], Section 5.2.2) and manipulation of spatial resolution [72]. However, the 
Ambisonics processing may have perceivable effects on the binaural signals, which 
are still being investigated. Recent research on this topic is reviewed in Section 3. 


3. Binaural Ambisonics-based reverberation and spatial resolution 


The spherical harmonics framework (known as Ambisonics in the context of 
audio production) allows to express a sound field as a continuous function on a 
sphere around the listener. Ambisonics sound fields are typically generated from 
microphone array recordings [73] or plane-wave-based simulations. Alternatively, 
it is often convenient to measure or simulate an Ambisonics RIR that can be con- 
volved with any anechoic audio signal to generate the sound field, e.g., as in [74]. 
Once encoded in the Ambisonics domain, a sound field can be mirrored, warped or 
rotated around the listener through inexpensive algebraic operations [56]. Addi- 
tionally, it is possible to modify its spatial resolution, which allows to reduce com- 
putational costs in the rendering process in exchange for potential perceptual 
degradation [72, 75, 76]. 

When a sound field is encoded in the Ambisonics domain, its spatial resolution is 
defined by its inherent ‘truncation order’, which is an integer equal or greater than 
zero. Higher-order signals have a larger number of channels and allow to produce 
binaural renderings with finer spatial resolution and sound sources that are easier to 
localise, while lower-order signals are more lightweight (fewer channels) and pro- 
duce renderings with lower resolution and ‘blurry’ sources (see Figure 2). This was 
shown by Avni et al. [77], who argued that truncating the order of an Ambisonics 
signal affected the perception of spaciousness and timbre in the resulting binaural 
signals. Later, Bernschiitz [66] reported that, in perceptual evaluations, listeners 
could not generally detect differences in binaural signals rendered from Ambisonics 
sound fields of order 11 and above. Then, Ahrens and Andersson [74] showed that 
an order of 8 might be sufficient to simulate lateral sound sources that are indistin- 
guishable from BRIR-based renderings, but slight differences were perceived up to 
order 29 for frontal sound sources. 

It has also been shown that the relation between spatial order and perceived 
quality also depends on the ‘decoding’ method that is used to translate the 
Ambisonics sound field to a pair of binaural signals. For instance, the time-alignment 
method [78] and the magnitude least squares (MagLS) method [79] have both been 
shown to produce more accurate binaural signals at lower spatial orders than other 
approaches, such as the widely used virtual loudspeakers method [80]. In the case of 
MagLS, which focuses on minimising magnitude errors (disregarding phase) at high 
frequencies, Sun [81] showed that a conceptually similar method was able to 
produce binaural signals that were indistinguishable from a high-order reference at 
orders as low as 14. 

Overall, previous studies have suggested that binaural signals can be accurately 
rendered from Ambisonics sound fields as long as the truncation order is high 
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Figure 2. 

Room impulse response encoded in the Ambisonics domain at different truncation orders (0 to 4), for a source 
placed in front of the listener. Data are plotted as sound pressure (in decibels relative to the peak value) along 
the time axis and over different azimuth angles on the horizontal plane. Source: Engel et al. [76] (‘trapezoid’ 
room). 


enough, probably somewhere between 8 and 29. However, such orders may still be 
too high to be computationally efficient (the number of channels of an Ambisonics 
signal is proportional to the square of its truncation order) or just unfeasible in 
practice (commercially available microphone arrays operate at order 4 or lower). 
The remainder of this section discusses some recent perceptual studies that 
explored how the binaural rendering of reverberant sound fields is affected when 
simplifications are applied in the Ambisonics domain, e.g., reducing the truncation 
order of different parts of the RIR. 


3.1 Hybrid Ambisonics 


A recent listening experiment by Lübeck et al. [75] showed that early reflections 
and late reverberation may be encoded in Ambisonics at a significantly lower order 
than the direct sound and still produce binaural signals that are indistinguishable 
from a BRIR-based rendering. The reason why this may happen is illustrated in 
Figure 2, which shows an RIR encoded in Ambisonics at different truncation orders. 
It can be seen how the lowest order (0) produces an isotropic signal which does not 
vary across directions in the horizontal plane, while higher orders achieve a more 
faithful representation of the sound field by allowing for spatially ‘sharper’ patterns— 
e.g., note how the direct sound becomes narrower as order increases, converging 
towards a spatial Dirac delta. Looking at this figure, it becomes apparent that earlier 
parts of the RIR (blue) are more sensible to spatial resolution changes due to order 
truncation, compared to late reverberation (green) which is less directional. 

According to these observations, it is reasonable to propose an Ambisonics- 
based binaural rendering method that employs a high truncation order for the direct 
sound (and, possibly, some early reflections) and lower orders for the rest of the 
RIR. Such a method could be highly efficient given that late reverberation usually 
accounts for the majority of the duration of the RIR. This approach, reminiscent to 
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the hybrid models discussed earlier, has been tentatively coined as ‘hybrid 
Ambisonics’. 

A perceptual study by Engel et al. [76] evaluated binaural signals generated with 
hybrid Ambisonics and the virtual loudspeaker method, and found that an order 
between 2 or 3 (dependent on the room) may be enough to render reverberation, 
assuming that the direct sound path is accurately reproduced through convolution 
with HRIRs (see Figure 3). This is a promising precedent for future efficient 
binaural rendering methods, although further investigations would be needed to 
generalise these results to a wider selection of rooms and stimuli types. In the 
future, a more general model could estimate the needed truncation order adaptively 
based on the Ambisonics signal (e.g., measuring its directivity over time), which 
could be used in efficient binaural renderers or as a way to compress spatial audio 
data. 


3.2 Reverberant virtual loudspeaker (RVL) 


In real-time interactive binaural simulations, RIRs are typically recomputed 
when there is a change in the scene such as movements of the listener or sources. 
When working in the Ambisonics domain, this recomputation is not needed in 
order to simulate a head rotation from the listener, as the signal can be efficiently 
rotated via a rotation matrix ([56], Section 5.2.2). However, translational move- 
ments of either the listener or a source still require to recompute the RIRs. Asa 
result, the number of sources that can be rendered simultaneously in a low-cost 
scenario might be limited. 

In such cases, it may be beneficial to employ a rendering method that scales well 
with the number of sources. One such example is the reverberant virtual loud- 
speaker method (RVL), an Ambisonics-based approach that has the advantage of 
requiring a fixed amount of real-time convolutions regardless of the number of 
sources [72, 76, 83]. This method takes inspiration from the virtual loudspeakers 
approach [71, 80], which decodes an Ambisonics sound field to a virtual loud- 
speaker grid around the listener and convolves the resulting signals with HRIRs to 
generate the binaural output. RVL performs this same process but, instead of 
HRIRs, the virtual loudspeaker signals are convolved with BRIRs, so the acoustics of 
the room are effectively integrated with the binaural rendering without the need for 
additional steps. Therefore, the number of real-time convolutions depends only on 
the truncation order of the sound field, independently of the number of rendered 
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Figure 3. 

Perceptual ratings of binaural renderings generated from the hybrid-Ambisonics RIRs of orders o to 4 are shown 
in Figure 2, where the direct sound was reproduced via convolution with a single HRIR. A dry rendering was 
used as the anchor signal and the 4th order signal, as the reference. The vertical dotted lines indicate that the 
groups on the left are significantly different (p < 0.05) from the groups on the right. Source: Engel et al. [76]. 
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sources. For this reason, RVL is highly efficient at rendering a large number of 
sources in real time (see Figure 4). Its main limitation is that the room is head- 
locked due to the set of BRIRs being fixed, so head rotations may lead to inaccurate 
reflections, as shown in Figure 5. 

RVL was perceptually evaluated in [76], paying particular attention to its effect 
on head rotations. For the assessment, the method was applied only to the rever- 
berant sound (direct sound was generated through convolution with HRIRs) and 
the implementation was done with the 3D Tune-In Toolkit spatial audio library 
[84]. Listeners were asked to compare RVL to first-order hybrid Ambisonics ren- 
derings (both head-tracked) of speech and music, by being asked ‘Considering the 
given room [shown in a picture], which example is more appropriate?’. Results 
suggested that the inaccurate head rotations could indeed be detected by listeners 
but were not necessarily perceived as a degradation in quality with respect to the 
more accurate rendering—note the bimodal distribution shown in Figure 6, which 
indicates that there was not a unanimous preference towards either rendering. 

One could speculate that the RVL method was preferred by some listeners due to 
the BRIR-based rendering leading to highly uncorrelated binaural signals, which are 
typically associated with higher perceived quality when evaluating late reverbera- 
tion (see the binaural quality index by Beranek [9]). An additional investigation to 
explore the matter further would be to compare the RVL method to other 
approaches that specifically aim to optimise interaural coherence, such as the 
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Figure 4. 

Comparison between the average execution time of the convolution stage in Ambisonics binaural rendering 
(‘standard’) and RVL binaural rendering, as a function of the number of rendered sources, for two different 
reverberation times (RT). A random input signal with a length of 1024 samples was used as input. Simulations 
were done in MATLAB (MathWorks) using the overlap-add method [82], running on a quad-core processor at 
2.8 GHz. Source: Engel et al. [76]. 


Figure 5. 

Da sound path and first-order early reflections as they reach the left ear of a listener in three scenarios: (left) 
before any head rotation; (middle) canonical rendering after a head rotation of 30 degrees clockwise; and (right) 
RVL rendering after the same head rotation. Note how, in the third scenario, the direct sound path is accurate, 
whereas the room is head-locked, affecting the incoming direction of reflections. Source: Engel et al. [76]. 
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Figure 6. 

Violin plot showing perceptual ratings from paired comparisons between first-order hybrid Ambisonics (A) and 
RVL (B) binaural renderings. Negative values represent preference towards a, while positive values represent 
preference towards B. Source: Engel et al. [76]. 


covariance constraint method proposed by Zaunschirm et al. [78] and described by 
Zotter and Frank ([56], Section 4.11.3). 

Regardless, further perceptual evaluations (e.g., in more rooms) would be 
needed to generalise these results. Overall, RVL could be a viable option to render 
binaural reverberation of a large number of sources in real time in a low-resource 
scenario. 


4. Future directions 


The trade-off between complexity and perceived quality when rendering binau- 
ral reverberation is still an area of major interest that has to be further explored. 
Recent studies have looked at the perceptual impact of varying spatial resolution of 
Ambisonics-based reverberation, but there are yet aspects of it that warrant further 
research. For instance, it would be interesting to explore an approach to compress 
Ambisonics RIRs by truncating their order depending on their directional and 
temporal information, as a way to compute and store them more efficiently. 

Another set of very relevant challenges will come from using artificial binaural 
reverberation in different contexts and tasks. For example, binaural audio has been 
used in the past for assisting blind individuals in learning about the spatial configu- 
ration of a closed environment before being physically introduced to it [85, 86]. 
Within that context, the creation of geometrically and spatially accurate real-time 
reverberation was extremely important and could be achieved only by performing a 
series of case-specific optimisations in the processing chain, for example, limiting 
navigation paths to a series of lines rather than a 2-dimensional space, and pre- 
calculating a set of Ambisonics RIRs computing in real-time only rotations and 
interpolations. Such optimisations can be allowed only within a research environ- 
ment, therefore real-life applications of such techniques are currently very limited. 
A better understanding of both the computational and perceptual sides of rever- 
beration, possibly specifically for blind and visually impaired individuals, could lead 
to major advancements in the development and use of auditory displays and assis- 
tive technologies, tools and devices. 

Looking ahead, AR applications could offer an interesting testbed for further 
research on binaural reverberation perception and rendering. One of the key 
research areas in AR/VR is 6DoF (or position-dynamic) audio rendering, where the 
listener is allowed to move around the scene, as opposed to traditional Ambisonics 
rendering where only head rotations are allowed (three degrees of freedom). Sev- 
eral methods have been recently proposed to efficiently extrapolate spatial audio 
signals from one listener position to another, either via simple parametric methods 
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[87] or more complex Ambisonics-based approaches that often rely on 
parametrising the sound field in ‘direct’ and ‘ambient’ components [60, 61], or 
according to the source distance [62, 63]. Significant advancements have also been 
made in terms of recording complex auditory scenes and to make them navigable in 
6DoF—in this case, specialised hardware and software has been released and is 
already available commercially [88]. Future improvements in 6DoF recording and 
rendering techniques will in turn allow for an increased level of interactivity within 
the simulation, as well as more effective evaluations of different audio rendering 
technologies using AR/VR systems. 

Focusing on the AR case, in order to blend real with virtual audio, it is essential 
to develop techniques for the automatic estimation of the reverberant characteris- 
tics of the real environment. New methods will need to be developed and evaluated 
for blending virtual audio sources within real scenes and to evaluate the impact of 
blending accuracy through metrics related to perceived realism and scene accept- 
ability. This can be achieved, for example, by characterising the acoustical environ- 
ment surrounding the AR user, using this in-situ data to synthesise virtual sounds 
with matching acoustic properties. Machine learning (ML) techniques could be 
employed to address the issue of blind acoustical environment characterisation by 
focusing first on overall room fingerprint evaluation (late reverberation), then on 
the finer details of the room response that vary depending on specific source 
positions (early reflections). The scene analysis could also be used to extract the 
direction-of-arrival for multiple sound sources and direct-to-reverberant energy 
ratio by separating source information from room and user acoustic properties. The 
data extracted by the model could then be employed to generate realistic virtual 
reverberation, which will be matched with the real-world reverberation. Of course 
for each step of this scenario several open challenges still exist, both from the 
computational point of view (e.g., how to generate geometrically and directionally 
accurate reverberation in real-time) and from the perceptual point of view (e.g., 
what is perceptually relevant and should therefore be computationally modelled 
and rendered, and what can be approximated). 

Better understanding the extent and origin of sensory thresholds in terms of 
reverberation perception, therefore, presents still a very open set of challenges, 
which will need to be addressed in the future through extensive listening experi- 
ments and, why not, also by means of binaural auditory models and ML-trained 
‘artificial listeners’. 


5. Conclusions 


Within this chapter, an overview was presented on perception and efficient 
simulation of reverberation. A special focus has been put on the case of binaural 
audio and, in particular, on Ambisonics-based and convolution-based rendering 
methods. The issue of the trade-off between computational cost and perceived 
quality has been discussed at length, mainly looking at the case of varying spatial 
resolution and implementation choices of Ambisonics-based renderings, highlight- 
ing the results of some recent studies on this matter. Considering the very rapid 
development and uptake of VR and AR technologies, it is particularly evident the 
importance of further research focusing on better understanding how computa- 
tional optimisations and simplifications can have an impact on the perceived quality 
and realism of the rendering. Some of the most relevant challenges in this area have 
been outlined at the end of the chapter, and will hopefully serve as a guideline for 
future research in the area. 
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